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Abstract — Assembling genomic sequences from a set 
of overlapping reads is one of the most fundamental 
problems in computational biology. Algorithms addressing 
the assembly problem fall into two broad categories - 
based on the data structures which they employ. The first 
class uses an overlap/string graph and the second type uses 
a de Bruijn graph. However with the recent advances in 
short read sequencing technology, de Bruijn graph based 
algorithms seem to play a vital role in practice. 

Efficient algorithms for building these massive de Bruijn 
graphs are very essential in large sequencing projects 
based on short reads. In JTJ, an 0(n/p) time parallel 
algorithm has been given for this problem. Here n is the 
size of the input and p is the number of processors. This 
algorithm enumerates all possible bi-directed edges which 
can overlap with a node and ends up generating 0(n£) 
messages. 

In this paper we present a Q(n/p) time parallel algo- 
rithm with a communication complexity equal to that of 
parallel sorting and is not sensitive to E. The generality 
of our algorithm makes it very easy to extend it even 
to the out-of-core model and in this case it has an 
optimal I/O complexity of Q( bi°s(m^b) )• We demonstrate 
the scalability of our parallel algorithm on a SGI/Altix 
computer. A comparison of our algorithm with that of 1 1 1 
reveals that our algorithm is faster. We also provide 
efficient algorithms for the bi-directed chain compaction 
problem. 

Index Terms — de Bruijn graph construction, parallel 
algorithms, out of core algorithms, sequence assembly 
algorithms, computational genomics 

I. Introduction 

The genomic sequence of an organism is a string 
from the alphabet S = {^4, T, G, C}. This string is also 
referred as the Deoxyribonucleic acid (DNA) sequence. 
DNA sequences exist as complementary pairs (A — T, 
G — C) due to the double strandedness of the underlying 
DNA structure. Several characteristics of an organism 
are encoded in its DNA sequence, thereby reducing the 



biological analysis of the organism to the analysis of its 
DNA sequence. Identifying the unknown DNA sequence 
of an organism is known as de novo sequencing and is 
of fundamental biological importance. On the other hand 
the existing sequencing technology is not mature enough 
to identify/read the entire sequence of the genome - 
especially for complex organisms like the mammals. 
However small fragments of the genome can be read with 
acceptable accuracy. The shotgun method employed in 
many sequencing projects breaks the genome randomly 
at several places and generates several small fragments 
{reads) of the genome. The problem of reassembling all 
the fragmented reads into a small sequence close to the 
original sequence is known as the Sequence Assembly 
(SA) problem. 

Although the SA problem seems similar to the Short- 
est Common Super string (SCS) problem, there are in 
fact some fundamental differences. Firstly, the genome 
sequence might contain several repeating regions. How- 
ever, in any optimal solution to the SCS problem we will 
not be able to find repeating regions - because we want 
to minimize the length of the solution string. In addition 
to the repeats, there are other issues such as errors in 
reads and double strandedness of the reads which make 
the reduction to SCS problem very complex. 

The literature on algorithms to address the SA problem 
can be classified into two broad categories. The first class 
of algorithms model a read as a vertex in a directed graph 
- known as the overlap graph 0. The second class of 
algorithms model every substring of length k (i.e., a k- 
mer) in a read as a vertex in a (subgraph of) a de Bruijn 
graph 0. 

In an overlap graph, for every pair of overlapping 
reads, directed edges are introduced consistent with the 
orientation of the overlap. Since the transitive edges 
in the overlap graph are redundant for the assembly 
process they are removed and the resultant graph is 



called the string graph (2]. The edges of the string graph 
are classified into optional, required and exact. The SA 
problem is reduced to the identification of a shortest walk 
in the string graph which includes all the required and 
exact constraints on the edges. Identifying such a walk 
- minimum S-walk - on the string graph is known to be 
NP-hard H. 

When a de Bruijn graph is employed, we model every 
substring of length k (i.e., a fc-mer) in a read as a vertex 
j3l . A directed edge is introduced between two /c-mers if 
there exists some read in which these two /c-mers overlap 
by exactly k — 1 symbols. Thus every read in the input 
is mapped to some path in the de Bruijn graph. The SA 
problem is reduced to a Chinese Postman Problem (CPP) 
on the de Bruijn graph, subject to the constraint that the 
resultant CPP tour include all the paths corresponding to 
the reads. This problem is also known to be NP-hard. 
Thus solving the SA problem exactly on both these graph 
models is intractable. 

Overlap graph based algorithms were found to per- 
form better (see i @ §) with Sanger based 
read methods. Sanger methods produce reads typically 
around 1000 base pairs long. However these can produce 
significant read errors. To overcome the issues with 
Sanger reads new read technologies such as the py- 
rosequencing (454sequencing) have emerged. These read 
technologies can produce reliable and accurate genome 
fragments which are very short (up to 100 base-pairs 
long). On the other hand short read technologies can 
increase the number of reads in the SA problem by a 
large magnitude. Overlap based graph algorithms do not 
scale well in practice since they represent every read as a 
vertex. De Bruijn graph based algorithms seem to handle 
short reads very efficiently (see Q) in practice compared 
to the overlap graph based algorithms. However the 
existing sequential algorithms [9] to construct these 
graphs are sub-optimal and require significant amounts 
of memory. This limits the applicability of these methods 
to large scale SA problems. In this paper we address 
this issue and present algorithms to construct large de 
Bruijn graphs very efficiently. Our algorithm is optimal 
in the sequential, parallel and out-of-core models. A 
recent work by Jackson and Aluru HI yielded parallel 
algorithms to build these de Bruijn graphs efficiently. 
They present a parallel algorithm that runs in 0(n/p) 
time using p processors (assuming that n is a constant- 
degree ploynomial in p). The message complexity of their 
algorithm is 0(nS). By message complexity we mean 
the total number of messages (i.e., /c-mers) communi- 
cated by all the processors in the entire algorithm. One 



of the major contributions of our work is to show that 
we can accomplish this in Q(n/p) time with a message 
complexity of 0(n). An experimental comparison of 
these two algorithms on an SGI Altix machine shows 
that our algorithm is considerably faster. In addition, our 
algorithm works optimally in an out-of-core setting. In 
particular, our algorithm requires only ®( B\og(M^B) ) ^ 
operations. 

The organization of the paper is as follows. In Sec- 
tion |II] we introduce some preliminaries and define a bi- 
directed de Bruijn graph formally. Section [III] discusses 
our main algorithm in a sequential setting. Section [V] 
and Section [Vj show how our main idea can easily be 
extended to parallel and out-of-core models optimally. 
In Section IV-AI we provide some remarks on the parallel 
algorithm of Jackson and Aluru ID. Section IVIII gives 
algorithms to perform the simplification operation on the 
bi-directed de Bruijn graph. Section IVIIII discusses how 
our simplified bi-directed de Bruijn graph algorithm can 
replace the graph construction algorithm in a popular 
sequence assembly program VELVET 0. Finally we 
present experimental results in Section JX] 

II. Preliminaries 

Let s E E n be a string of length n. Any substring sj 
(i.e., s[j]s[j + 1] . . . s[j + k - 1], n - k + 1 > j > 1) of 
length k is called a k— mer of s. The set of all k— mers 
of a given string s is called the k— spectrum of s and is 
denoted by S(s, k). Given a k— mer Sj, Sj denotes the 
reverse complement of Sj (e.g., if Sj = AAGTA then 
Sj = TACTT). Let < be the partial ordering among the 
strings of equal length, then Sj < Sj indicates that the 
string Si is lexicographically smaller than Sj. Given any 
k— mer Sj, let Sj be the lexicographically smaller string 
between Sj and Sj. We call Sj the canonical k— mer of 
Si. In other words, if Sj < s~i then Sj = Sj otherwise 
Sj = Sj. A k— molecule of a given k— mer Sj is a 
tuple (si,Si) consisting of the canonical k— mer of Sj 
and the reverse complement of the canonical k— mer. In 
the rest of this paper we use the terms positive strand 
and canonical k— mer interchangeably. Likewise the non- 
canonical k— mer is referred to as the negative strand. 

A bi-directed graph is a generalized version of a 
standard directed graph. In a directed graph every edge 
has only one arrow head (-> or <-). On the other 
hand in a bi-directed graph every edge has two arrow 
heads attached to it (<H>, <-<,[>-<! or >-[>). Let V 
be the set of vertices and E = {(vi,Vj,oi,02)\vi,Vj <E 
F A 01,02 G {O;^}} be the set of bi-directed edges 
in a bi-directed graph G(V, E). For any edge e = 
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Fig. 1. Bi-directed graph and bi-directed walks 



(vi,v u , 01,02) £ E, o\ = e[o + \ and o 2 = e[o~] refer 
to the orientations of the arrow heads on the vertices 
Vi and Vj, respectively. A walk W(vi,Vj) between two 
nodes Vi,Vj G V of a bi-directed graph G(V,E) is a 
sequence Vi, & ix , v ix ,e i2 ,v i2 , . . . , v im , e im+1 , Vj, such that 
for every intermediate vertex , 1 < I < m the 
orientation of the arrow head on the incoming edge 
adjacent on is opposite to the orientation of the arrow 
head on the out going edge. To make this clearer, let 
ej, , Vi t , ei l+1 be a sub-sequence in the walk W(vi,Vj). 
If e», = (v il _ 1 ,v it ,o 1 ,o 2 ),e il+1 = (v in v il+1 , o' x , o' 2 ) then 
for the walk to be valid it should be the case that o 2 ^ o[ . 
Figure [T^a) illustrates an example of a bi-directed graph. 
Figure QJ6) shows a simple bi-directed walk between the 
nodes A and E. Bi-directed walk between two nodes 
may not be simple. Figure [TJc) shows a bi-directed walk 
between A and E which is not simple - because B 
repeats twice. 

A de Bruijn graph D k {s) of order A; on a given string 
s is defined as follows. The vertex set V of D k {s) is 
defined as the k— spectrum of s (i.e. V = E(s,k)). We 
use the notation suf(vi,l) (pre(vi,l), respectively) to 
denote the suffix (prefix, respectively) of length / in the 



string Vi. Let the symbol o denote the concatenation op- 
eration between two strings. The set of directed edges E 
of D k {s) is defined as follows: E = {(vi,Vj)\suf(vi, k— 
1) = pre(vj,k — 1) A Vi[l] o suf{vi,k — 1) o Vj[k] G 
S(s, fe+1)}. We can also define de Bruijn graphs for sets 
of strings as follows. If S = {s±, s 2 . . . s n } is any set of 
strings, a de Bruijn graph B k (S) of order A; on 5 has 
V = U^ =1 S(si,fc) and E = {(vi,Vj)\suf(vi,k - 1) = 
pre(vj,k — 1) A 3/ : Vi[l] o suf(vi,k — 1) o Vj[k] € 
S(s/, k + 1)}. To model the double strandedness of the 
DNA molecules we should also consider the reverse 
complements (S = {si, £2 ■ ■ ■ s n }) while we build the 
de Bruijn graph. 

To address this a bi-directed de Bruijn graph BD k (SU 
S) has been suggested in j4). The set of vertices V of 
BD k (SU S) consists of all possible k— molecules from 
S U S. The set of bi-directed edges for BD k (S U S) 
is defined as follows. Let x, y be two k— mers which 
are next to each other in some input string z € S U S. 
Then an edge is introduced between the k— molecules 
Vi and Vj corresponding to x and y, respectively. Please 
note that two consecutive /c— mers in some input string 
always overlap by k — 1 symbols. The converse need 
not be true. The orientations of the arrow heads on the 
edges are chosen as follows. If both x and y are the 
positive strands in v-i and Vj, respectively, then an edge 
(vi,Vj,>,>) is introduced. If x is the positive strand in 
Vi and y is the negative strand in Vj an edge (vi,Vj,>,<) 
is introduced. Finally, if x is the negative strand in Vi 
and y is the positive strand in Vj an edge (vi,Vj, <,\>) 
is introduced. 

Figure [2] illustrates a simple example of the bi-directed 
de Bruijn graph of order k = 3 from a set of reads 
ATGG,CCAT,GGAC,GTTC,TGGA and TGGT 
observed from a DNA sequence ATGGACCAT 
and its reverse complement ATGGTCC AT . Consider 
two 3— molecules v\ = (GGA, TCC) and v-i = 
(GAC, GTC). Because the positive strand x = GGA 
in v\ overlaps the positive strand y = GAC in v 2 by 
string GA, an edge (v±, v%, >, D>) is introduced. Note 
that the negative strand GTC in v 2 also overlaps the 
negative strand TCC in v 2 by string TC, so the two 
overlapping strings GA and TC are drawn above the 
edge (vi,v 2 , >, t>) in Figure|2] A bi-directed walk on the 
example bi-directed de Bruijn graph as illustrated by the 
dash line is corresponding to the original DNA sequence 
with the first letter omitted TGGACCAT. We would 
like to remark that the parameter k is always chosen to be 
odd to ensure that the forward and reverse complements 
of a /c-mer are not the same. 
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ATGG, CCAT 
GGAC, GTTC 
TGGA, TGGT 
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Fig. 2. Bi-directed de Bruijn graph example 



III. Our algorithm to construct bi-directed 
de Bruijn graphs 

In this section we describe our algorithm BiConstruct 
to construct a bi-directed de Bruijn graph on a given 
set of reads. The following are the main steps in our 
algorithm to build the bi-directed de Bruijn graph. Let 
Rf = {n, r2 . . . r n }, Ti G S r be the input set of reads. 
Rj = {fi, f~2 ... r^} is a set of reverse complements. Let 
R* = R f U R f and i? fc+1 = U re/? .S(r, /c + 1). i? fe+1 is 
the set of all (k + l)-mers from the input reads and their 
reverse complements. 

• [STEP-1] Generate canonical edges: Let (x,y) = 
(z[l . . . k], z[2 . . . k+1]) be the k— mers correspond- 
ing to a (k + l)-mer z[l . . . k + 1] G R k+1 . Recall 
that x and y are the canonical /c— mers of x and 
y, respectively. Create a canonical bi-directed edge 
(vi, Vj, oi, 02 ) for each (fc + l)-mer as follows. 



(v i ,v j ,o 1 ,o 2 ) 



(x,y,>,>) 
(y, &,<,<) 



{x,y,<,>) 
(y,x, <,>) 



(£,&>,<) 
(y,x,[>,<) 



(x,y, <,<) 
(ft £,>,>) 



» = a;,y = y 
IF x < y, 
ELSE 

x x A y = y 
IF x < y 
ELSE 

x = x A y ^ y 
IF x < y 
ELSE 

x ^ x A y ^ y 
JF x < y, 
ELSE 



. [STEP-2] Reduce multiplicity: Sort all the bi- 
directed edges in [STEP-1], using radix sort. Since 
the parameter k is always odd this guarantees 
that a pair of canonical /c-mers have exactly one 
orientation. Remove the duplicates and record the 
multiplicities of each canonical edge. Gather all the 
unique canonical edges into an edge list £ . 

• [STEP-3] Collect bi-directed vertices: For each 
canonical bi-directed edge (vi,Vj, 01,02) G £, col- 
lect the canonical /c-mers Uj, vj into list V. Sort the 
list V and remove duplicates, such that V contains 
only the unique canonical £>mers. 

• [STEP-4] Adjacency list representation: The list 
£ is the collection of all the edges in the bi-directed 
graph and the list V is the collection of all the nodes 
in the bi-directed graph. It is now easy to use £ and 
generate the adjacency lists representation for the 
bi-directed graph. This may require one extra radix 
sorting step. 

IV. Analysis of the algorithm BiConstruct 

Theorem 1: Algorithm BiConstruct builds a bi- 
directed de Bruijn graph of the order k in 0(n) time. 
Here n is number of characters/symbols in the input. 

Proof: Without loss of generality assume that all 
the reads are of the same size r. Let N be the num- 
ber of reads in the input. This generates a total of 
(r — k)N (k + l)-mers in [STEP-1]. The radix sort 
needs to be applied at most 2A;log(|S|) passes, resulting 
in 2k log(|E|)(r — k)N operations. Since n = Nr is the 
total number of characters/symbols in the input, the radix 
sort takes 0(fcnlog(|E|)) operations assuming that in 
each pass of sorting only a constant number of symbols 
are used. If A;log(|S|) = O(logiV), the sorting takes 
only 0(n) time. In practice since N is very large in 
relation to k and |S|, the above condition readily holds. 
Since the time for this step dominates that of all the 
other steps, the runtime of the algorithm BiConstruct is 
G(n). ■ 

V. Parallel algorithm for building 
bi-directed de Bruijn graph 

In this section we illustrate a parallel implementation 
of our algorithm. Let p be the number of processors 
available. We first distribute N/p reads for each proces- 
sor. All the processors can execute [STEP-1 ] in parallel. 
In [STEP-2] we need to perform parallel sorting on the 
list £. Parallel radix/bucket sort -which does not use 
any all-to-all communications- can be employed to ac- 
complish this. For example, the integer sorting algorithm 



of Kruskal, Rudolph and Snir takes O ^ lo g/ p) j time. 
This will be 0(n/p) if n is a constant degree polynomial 
in p. In other words, for coarse-grain parallelism the run 
time is asymptotically optimal. In practice coarse-grain 
parallelism is what we have. Here n = N(r — k + 1). 
We call this algorithm Par-BiConstruct. 

Theorem 2: Algorithm Par-BiConstruct builds a bi- 
directed de Bruijn graph in time 0(n/p). The message 
complexity is 0(n). 

A. Some remarks on Jackson and Alum 's algorithm 

The algorithm of Jackson and Aluru JT| first identifies 
the vertices of the bi-directed graph - which they call 
representative nodes. Then for every representative node 
| £ | many-to-many messages were generated. These mes- 
sages correspond to potential bi-directed edges which 
can be adjacent on that representative node. A bi-directed 
edge is successfully created if both the representatives of 
the generated message exist in some processor, otherwise 
the edge is dropped. This results in generating a total 
of G(n|S|) many-to-many messages. The authors in the 
same paper demonstrate that communicating many-to- 
many messages is a major bottleneck and does not scale 
well. On the other had we remark that the algorithm 
BiConstruct does not involve any many-to-many com- 
munications and does not have any scaling bottlenecks. 

On the other hand the algorithm presented in their 
paper [T| can potentially generate spurious bi-directed 
edges. According to the definition 01 of the bi-directed 
de Bruijn graph in the context of SA problem, a 
bi-directed edge between two /c-mers/vertices exists 
iff there exists some read in which these two k- 
mers are adjacent. We illustrate this by a simple 
example. Consider a read n = AATGCATC. If 
we wish to build a bi-directed graph of order 3, 
then {AAT, ATG, TGC, GCA, CAT, ATC} form 
a subset of the vertices of the bi-directed graph. In 
this example we see that /c-mers AAT and ATC 
overlap by exactly 2 symbols. However there cannot 
be any bi-directed edge between them according 
to the definition - because they are not adjacent. 
On the other hand the algorithm presented in 0]] 
generates the following edges with respect to A;-mer 
AAT: { (AAT, ATA) , (AAT, ATG) , (AAT, ATT), 
(AAT, ATC)}. The edges (AAT, ATA) and 
(AAT, ATC) are purged since the £>mers ATA 
and ATC are missing. However bi-directed edges with 
corresponding orientations are established between 
ATG and ATC. Unfortunately (AAT, ATC) is a 
spurious edge and can potentially generate wrong 




Fig. 3. Problems with pointer jumping on bi-directed chains 

assembly solutions. In contrast to their algorithm 12 
our algorithm does not use all-to-all communications - 
although we use point-point communications. 

VI. Out of core algorithms for building 

BI-DIRECTED DE BRUIJN GRAPHS 

Theorem 3: There exists an out-of core algorithm to 
construct a bi-directed de Bruijn graph using an optimal 
number of I/O's. 

Proof: Sketch: Replace the radix sorting with an 
external R— way merge which takes only &( B\og(M^B) )• 
Where M is the main memory size, n is the sum of the 
lengths of all reads, and B is the block size of the disk. 

■ 

VII. Simplified bi-directed de Bruijn graph 

The bi-directed de Bruijn graph constructed in the 
previous section may contain several linear chains. These 
chains have to be compacted to save space as well as 
time. The graph that results after this compaction step is 
refered to as the simplified bi-directed graph. A linear 
chain of bi-directed edges between nodes u and v can 
be compacted only if we can find a valid bi-directed 
walk connecting u and v. All the /c-mers/vertices in a 
compactable chain can be merged into a single node, and 
relabelled with the corresponding forward and reverse 
complementary strings. In Figure [4] we can see that the 
nodes X\ and X% can be connected with a valid bi- 
directed walk and hence these nodes are merged into a 
single node. In practice the compaction of chains seems 
to play a very crucial role. It has been reported that 
merging the linear chains can reduce the number of 
nodes in the graph by up-to 30% Q. 

Although bi-directed chain compaction problem seems 
like a list ranking problem there are some fundamental 
differences. Firstly, a bi-directed edge can be traversed in 
both the directions. As a result, applying pointer jumping 
directly on a bi-directed graph can lead to cycles and 
cannot compact the bi-directed chains correctly. Figure [3] 
illustrates the first phase of pointer jumping. As we 
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Fig. 4. Issues with partially compacted bi-directed chains 



u+ v+ 




Fig. 5. Transforming bi-directed chain compaction to list ranking 



can see, the green arcs indicate valid pointer jumps 
from the starting nodes. However since the orientation 
of the node Y4 is reverse relative to the direction of 
pointer jumping a cycle results. In contrast, a valid bi- 
directed chain compaction would merge all the nodes 
between Y\ and Y5 since there is a valid bi-directed walk 
between Y\ and Y5. On the other hand, bi-directed chain 
compaction may result in inconsistent bi-directed edges 
and these edges should be recognised and removed. 
Consider a bi-directed chain in Figure [4} this chain 
contains two possible bi-directed walks - Y\ to I4 and 
X\ to X 3 . The walk from Y\ to Y4 (I4 to Yi) spells out 
a label ATAGGT(ACCTAT) after compaction. Once 
we perform this compaction the edge between Y4 and 
Z\ in the original graph is no longer valid - because the 
A;-mer on Z\ cannot overlap with the label ACCTAT. 
The same is true for the compaction of the bi-directed 
walk between X\ and X%. The redundant edges after 
compaction are marked in red. Since bi-directed chain 
compaction has a lot of practical importance efficient 
and correct algorithms are essential. 

We now provide algorithms for the bi-directed chain 
compaction problem. Our key idea here is to transform a 
bi-directed graph into a directed graph and then apply list 
ranking. Given a list of candidate canonical bi-directed 
edges, we apply a ListRankingTransform (see Figure |5]> 
which introduces two new nodes v + ,v~ for every node 
v in the original graph. Directed edges corresponding to 
the orientation are introduced. See Figure [5] 

Lemma 1: Let BG(V,E) be a bi-directed graph; let 
BG t (V t ,E t ) be the directed graph after applying Lis- 
tRankingTransform. Two nodes u, v G V are connected 
by a bi-directed path iff u + G V 1 (u~ G V*) is connected 
to one of v + (v~) or v~(v + ) by a directed path. 

Proof: We first prove the forward direction by 
induction on the number of nodes in the bi-directed 
graph. Consider the basis of induction when |V| = 2, 
let Vo , V\ G V. Clearly we are only interested when 
vo and v\ are connected by a bi-directed edge. By 
the definition of ListRankingTransform the Lemma in 
this case is trivially true. Now consider a bi-directed 
graph with |V| = n + 1 nodes, if the path between 
Vi,i < n and vj ,j<n does not involve node v n 
the lemma still holds by induction on the sub bi- 
directed graph BG(V — {v n },E). Now assume that 
Vi . . . v p , v n , v q . . . Vj is the bi-directed path between v\ 
and vj involving the node v n ; see Figure Ha). Also 
Figure Oa) shows how the transformed directed graph 
look like; observe the colors of bi-directed edges and 
corresponding directed edges. By induction hypothesis 



on the sub bi-directed paths vt . . . v p , v n and v n ,v q . . . Vj 
we have the following, vf is connected to vf or v~ by 
some directed path Pi (See Figure Ob); vf is connected 
to Vj~ or v~ by some directed path P2. We examine three 
possible cases depending on how the directed edge from 
Pi and P2 is incident on vf. In CASE-1 we have both 
Pi and P2 pointing into node «+. This implies that the 
orientation of the bi-directed edges in the original graph 
is according to Figure |6tb). In this case we cannot have a 
bi-directed walk involving the node v n , which contradicts 
our original assumption. Similarly CASE-2(Figure[6]c)) 
would also lead to a similar contradiction. Only CASE- 
3 would let node v n involve in a bi-directed walk. In 
this case vf will be connected to either vf or vj by 
concatenation of the paths Pi, P2. We can make a similar 
argument to prove the reverse direction. ■ 

A. Algorithm for bi-directed chain compaction 

We first identify a set of candidate bi-directed edges 
which can potentially form a chain. One possible crite- 
rion will be to include all the edges which are adjacent 
on bi-directed nodes with exactly one in and out degree. 
Each candidate bi-directed edge is transformed using 
ListRankingTranform and list ranking is applied on 
resultant set. As a consequence of the symmetry in 
ListRankingTransform we would see both forward and 
reverse complements of the compacted chains in the 
output. We can further canonicalize each chain and 
remove the duplicates by sorting. This results in unique 
bi-directed chains from the candidate bi-directed edges. 
Finally we report only the chains which are resultant 
of compaction of at least three bi-directed nodes. This 
removes all the inconsistent edges (see Figure [4]) from 
further consideration. As a consequence of Lemma []] all 
the bi-directed chains are correctly compacted. 

B. Analysis of bi-directed compaction on parallel and 
out-of-core models 

Let £1 be the list of candidate edges for compaction. 
To do compaction in parallel, we can use a Segmented 
Parallel Prefix on p processors to accomplish this in 
time 0(\2£i\/p + \og(p)). On the other hand list ranking 
can also be done out-of-core as follows. Without loss 
of generality we can treat the input for the list ranking 
problem as a set S of ordered tuples of the form (x,y). 
Given S we create a copy and call it S'. We now perform 
an external sort of S, S' with respect to y (i.e., using the 
y value of tuple (x, y) as the key) and x respectively. The 
two sorted lists are scanned linearly to identify tuples 
(x,y) € S, (x',y') € S' such that y = x' . These two 




Fig. 6. Proof that ListRankingTransform preserves bi-directed walk 
in the original graph. 



tuples are merged into a single tuple (x,y') and are added 
to a list £',. This process is now repeated on £\. Note 
that if the underlying graph induced by S\ does not have 
any cycles then \£[\ < \£i\/2; which means that the size 
of £[ geometrically decreases after every iteration. The 
I/O complexity of each iteration is dominated by the 
external sorting and hence bi-directed compaction can be 
accomplished out-of-core with 6(|£j|/J3 logM (\£i\/B)) 
I/O operations. 

Care should be taken to deal with cycles. There are 
two ways of dealing with cycles. One way is to first 
identify all the cycles in the bi-directed graph (generated 
in the previous section) and then break these cycles by 
removing appropriate edges. A second easy is to identify 
the cycles on the fly and break them. We employ the 
second approach. 

VIII. Improving the construction of the 

BI-DIRECTED DE BRUIJN GRAPH IN SOME PRACTICAL 
ASSEMBLERS 

In this section we briefly describe how our algo- 
rithms can be used to speedup some of the existing SA 
programs. As an example, we consider VELVET |9l . 
VELVET is a suite of programs - velveth and velvetg, 
which has recently gained acclamation in assembling 
short reads. VELVET program builds a simplified bi- 
directed graph from a set of reads. We now briefly 
describe the algorithm used in VELVET to build this 
graph. VELVET program puts all the A:-mers from the 
input into a hash table and then identifies the A;-mers 
which are present in at least 2 reads - this information 
is called the roadmap in VELVET'S terminology. The 
program then builds a de Bruijn graph using these k- 
mers. Finally it takes every read and threads it on these 
fc-mers. The worst case time complexity is 0(n log(n)) - 
assuming that k and |S| are constants. On the other hand 
since VELVET builds this graph entirely in-memory, 
this has some serious scalability problems especially on 
large scale assembly projects. However VELVET seems 
to have some very good assembly heuristics to remove 
errors and identify redundant assembly paths, etc. Thus 
our algorithm can act as a replacement to code in 
VELVET which performs in-memory graph construction. 

IX. Experimental results 

We have compared the performance of our algorithm 
and that of Jackson and Alum (T). We refer to the 
later algorithm as JA. To make this comparison fair, 
we have implemented the JA algorithm also on the 
same platform that our algorithm runs on. We have 
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TABLE I 

COMPARISION BETWEEN THE JA ALGORITHM AND OUR 
ALGORITHM 



used a SGI/Altix IA-64 machine with 64 nodes for all 
of our experiments. Our implementation uses MPI for 
communication between the processors. Table U shows 
the user and system times for both our algorithm and the 
JA algorithm. We can clearly see that the system time 
(communication time) is consistently higher for the JA 
algorithm. Also, as we move from one million to eight 
million reads the increase in the system time is quite 
significant for the JA algorithm (e.g., the system time 
for JA increases from 0.621 sec to 4.120 sec, which is 
almost a 7X increase. On the other hand there is only 
a 3X increase in our algorithm). The JA algorithm has 
a higher communication cost because it enumerates all 
the bi-directed edges and uses many-to-many messages 
to send to the right processor. 

The user time of our algorithm is also consistently 
superior compared to the user time of JA. This clearly is 
because we do much less local computations. In contrast, 
JA needs to do a lot of local processing which arises from 
processing all the received edges, removing redundant 
ones, and collecting the necessary edges to perform 
many-to-many communications. Since JA was taking a 
significant amount of time on for inputs larger than 8 mil- 
lion we have compared these algorithms only for input 
sizes up to 8 million. The experimental results reported 
in 12 start with at least 64 processors. We however show 
the scalablity of our algorithm upto 128 million reads in 
Table JI] Table JI] clearly demonstrates the scalability of 
our algorithm. We make our implementations available at 
http://trinity.engr.uconn.edu/~vamsik/ParBidirected.tgz 

X. Conclusions 

In this paper we have presented an efficient algo- 
rithm to build a bi-directed de Bruijn graph which is 



p 


user time 
(ticks) 


sys time 
(ticks) 


wall time 
(min:sec) 


READS=16777216 


2 


37147 


259 


1:14.02 


4 


37254 


85 


0:38.95 


8 


20217 


57 


0:21.90 


16 


16951 


55 


0:19.73 


32 


12901 


40 


0:16.38 


READS=33554432 


2 


148070 


1219 


2:42.66 


4 


99067 


677 


1:48.60 


8 


47319 


322 


0:55.41 


16 


17936 


135 


0:25.64 


32 


9973 


191 


0:17.55 


READS=67108864 


2 


340653 


2348 


6:18.77 


4 


240861 


1931 


4:14.57 


8 


153782 


1781 


2:39.18 


16 


64408 


804 


1:10.91 


32 


46659 


486 


0:53.32 


READS=134217728 


2 


770922 


5560 


15:00.42 


4 


471196 


4272 


8:29.62 


8 


314281 


3456 


5:17.65 


16 


135562 


2148 


2:21.83 


32 


82414 


950 


1:28.87 



TABLE II 

SCALABLILITY OF OUR ALGORITHM FOR UP TO 128 MILLION 
READS 



[5] X. Huang, J. Wang, S. Aluru, S. . Yang, and L. Hillier, "Pcap: A 
whole-genome assembly program," Genome research, vol. 13, 
no. 9, pp. 2164-2170, 2003, cited By (since 1996): 61. [Online]. 
Available: www.scopus. com] 

[6] E. W. Myers, G. G. Sutton, A. L. Delcher, I. M. Dew, D. P. Fa- 
sulo, M. J. Flanigan, S. A. Kravitz, C. M. Mobarry, K. H. I. Rein- 
ert, K. A. Remington, E. L. Anson, R. A. Bolanos, H. . Chou, 
C. M. Jordan, A. L. Halpern, S. Lonardi, E. M. Beasley, R. C. 
Brandon, L. Chen, P. J. Dunn, Z. Lai, Y. Liang, D. R. Nusskern, 
M. Zhan, Q. Zhang, X. Zheng, G. M. Rubin, M. D. Adams, and 
J. C. Venter, 'A whole-genome assembly of drosophila," Science, 
vol. 287, no. 5461, pp. 2196-2204, 2000. 

[7] S. Batzoglou, D. B. Jaffe, K. Stanley, J. Butler, S. Gnerre, 
E. Mauceli, B. Berger, J. P. Mesirov, and E. S. Lander, 'Arachne: 
A whole-genome shotgun assembler," Genome research, vol. 12, 
no. 1, pp. 177-189, 2002. 

[8] "Phrap assembler," [http://www.phrap.org/]. 

[9] D. R. Zerbino and E. Birney, "Velvet: Algorithms for de novo 
short read assembly using de bruijn graphs," Genome research, 
vol. 18, no. 5, pp. 821-829, 2008. 



a fundamental data structure for any sequence assembly 
program - based on Eulerian approach. Our algorithms 
are also efficient in parallel and out of core settings. 
These algorithms can be used in building large scale bi- 
directed de Bruijn graphs. Also, our algorithm does not 
employ any all-to-all communications in parallel setting 
and performs better than that of Jackson and Aluru HI. 
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